Codecademy Portfolio Project: "OKCupid Date-A-Scientist"

Project Author: Alexander Lacson

!Behind the Scenes!

This article is a behind the scenes look at the inspection, cleaning, processing, analysis, interpretation, and modelling of the data. It is geared more towards a techincal audience. If you want to see a user-friendly summary, there is a separate article here (Work in Progress). View this project's readme for reproducibility information.

Project Description

In this project, I will work with data from OKCupid, an online dating app. This dataset was provided to me by Codecademy as part of their "Data Science Career Path". In this project I seek to accomplish the following:

Let's begin with inspection of the data.

Inspection

Let's see how these feature values actually look like:

To get a proper idea regarding these essays and columns containing text, let's print out a single user's data.

To find out how long ago this sample was taken, let's look at the range of values of last_online.

"This data is very old. We can only make inference about OKCupid's users during the year 2012!"

Finally, before doing anything to the data, let's see which features have missing values.

Inspection Recap

In this section we have learned the following:

Cleaning and Tidying

The data has to be cleaned and preprocessed before it can be analyzed. Let's start with replacing the '-1' in the income field with NaN, the null value recognized by Pandas and NumPy.

We can reinspect the missing values visualization to confirm the replacement of the null values.

It appears that income is one of the pieces of information that people would least like to share. Later on, we could make a detailed comparison of the percentage of missing values of each column to evaluate "willingness of users to share information".

Let's move on to the HTML formatted text data. Not only is it more difficult to read, it is also not suitable for Natural Language Processing. Let's clean up the text using an HTML Parser and Regex. I will demonstrate the process when applied to a single entry first.

Before:

After:

Now let's apply this to all of the text in the data.

Lots of warnings given, mostly because users are including YouTube URLs. It's good to be made aware, as the URLs could affect the anlaysis later.

Cleaning Recap

In this section, we accomplished the following:

Feature Engineering Part 1

Machine Learning can work better if there are more columns because we're giving it more points of comparison. The addition of more columns, also called features, is feature engineering.

I'm going to back up the dataframe in its current form, so that even after modification and addition of columns, we can easily refer to the original data if necessary.

Splitting Columns

We can produce new features by splitting the existing ones. Some of our features are actually describing two variables that are potentially independent of each other. Later on, before we develop our model, we will investigate variable codependence by checking Pearson correlations.

Splitting various columns into two

Columns split in two:

Original Feature New Feature 1 New Feature 2
diet diet_adherence diet_type
location city state
offspring offspring_want offspring_attitude
religion religion_type religion_attitude
sign sign_type sign_attitude

Splitting 'pets'

There are really two kinds of pets among the values: dogs and cats. Let's proceed by splitting 'pets' into 'dogs' and 'cats'.

Having a pet can have nothing to do with liking a pet. Let's further split this into 'dog_preference', 'has_dogs', 'cat_preference', 'has_cats'. Let's also remove 'dogs' and 'cats'.

Splitting 'speaks'

There's quite some unpacking that needs to be done here. There appear to be several different languages and different options for fluency. Before we start making new columns, let's get a better sense for what exactly our values are.

After isolation of terms it is revealed that among the 'speaks' values there are 77 different languages and four different descriptors of language fluency. The ambiguity of the fluency options presents a dilemma. What is 'afrikaans' supposed to mean compared to 'afrikaans (okay)'? Because of this ambiguity, we will not make use of the fluency descriptors in our visualization. We will create a new column for each language containing a 1 to indicate if the language is spoken, 0 if not. The result is actually called a sparse matrix. 'Sparse' because it contains much more 0s than 1s.

Print memory usage function. Let's investigate the memory being used up by the sparse matrix.

Converting our sparse matrix to a Pandas Sparse Array reduces its memory usage and allows AI algorithms to take less time training on it.

Adding columns

There are other ways to derive new features in addition to splitting.

Added Columns:

New Feature Description
num_ethnicities Contains the number of ethnicities listed in 'ethnicity'
optional_%unfilled Percentage of optional fields unfilled
num_languages Count of languages spoken

Feature Engineering Part 1 Recap

In this section, we accomplished the following:

Original Feature New Feature 1 New Feature 2
diet diet_adherence diet_type
location city state
offspring offspring_want offspring_attitude
religion religion_type religion_attitude
sign sign_type sign_attitude
New Feature Description
num_ethnicities Contains the number of ethnicities listed in 'ethnicity'
optional_%unfilled Percentage of optional fields unfilled
num_languages Count of languages spoken

Visualization

Numerical Features

The best way to explore data is to visualize it. Let's start by generating histograms and boxplots of our numerical features.

Use the dropdown selector to switch between features. Datapoints show more information on mouseover. The graph can be panned and zoomed.

Feature Comment
Age The median age is 30. The distribution is right-skewed. Most users are young and working adults.
Height Most heights range between 59 and 78 in (4.9 - 6.5 ft) with a median of 68in(5.67ft). The distribution appears normal.
Income The median income is \$50k. A quarter of all incomes are $25k. Most income distributions are log-normal (if the ultra-wealthy are not included), and this is no different.
All Features Above From the box plot, we can see that all of the distributions have outliers - There is a 4.5% group of \$1M earners. There are two people over 100 years old. There is a height of 1in. Below, we will inspect these data points (user profiles) to see what's really going on.
optional_%unfilled The y-axis shows the percentage of users and the x-axis shows the percentage of optional user information fields left unfilled. Interquartile Range is 12% - 30%, meaning half of all users don't bother to fill 12% - 30% of optional fields. A quarter of users are below that range and a quarter of users are above that range. This is a feature that was engineered from the given raw data.

Outlier Inspection

Age

Let's look a look at the profiles of our centennial users.

The first one has 95% unfilled optional fields. The second has 54% unfilled optional fields and a height of 95inches(8ft tall). Do you think these profiles are reliable?

Height

Below is an example of the kind of user profile which 'has a height of 1 inch'.

Income

Below is the profile of someone who has an income of \$1M.

Graduated from space camp, complains about OKCupid picture takedown in essay0, nine ethnicities, five languages, 5' 10" in height. Do you think this income is reliable?

Decision on Outliers

We will remove outliers for age and height. Outliers have the potential to greatly increase memory usage, variance, and training time. The current objective is to make a predictive model that works. If that objective is accomplished, we have the option of reiterating on this project to incorporate outliers.

Categorical Features

Interactive Treemap (only interactive if opened with Jupyter Notebooks and required packages are installed)

An interactive tool for visualizing the categorical features as a treemap. The dropdown selector allows us to choose the feature to display, and the tickbox allows us to choose whether to include NaN values in the treemap.

Static Image Preview

image.png

Dashboard style donut plot grid of variables

Bias:

The biggest indicator that this sample does not meet the statistical criteria for independent random sampling is the 'state/country' variable. OKCupid was founded in 2004. It's simply not possible that by 2012 all of their users would only be from the state of California. When you get a dataset, regardless of what you've been told about the quality of the sampling, always check for signs of bias. The process of checking for bias in a study or in a sample is sometimes called a "Risk of Bias Assessment".

Even though the sample looks like it's heavily biased, we will still draw inference about what is represented in the sample.

The charts tell us that the typical profile on OKCupid back in 2012 was:

It's possible that people misrepresent themselves on their profile, paint themselves more positively, and carefully omit negative information.

Top Spoken Languages

Everyone speaks English. Hilariously some people speak C++... why not Python?

Unfilled Optional Fields Sorted

Disclaimer: My comments below are pure speculation and hypothesis

Feature Hypothesis For Not Sharing Information
Income If you're rich you don't want the IRS to know. If you're poor, you don't want potential matches to see that either
Children Finding out someone has kids can be a turn off and perceived as extra baggage
Diet People are afraid of being criticized for choosing to shun some foods
Religion Some people are xenophobic. As a result, some people hide their religion so as not to immediately turn away those xenophobes
Pets No idea why a third of users don't share this information
Essays There's a pattern where the higher up the chart you go, the higher the essay number. This is because the questions are presented to the user in a fixed order. Not all users have the patience to answer questions all the way to the last one. The reason why essay 8 stands out significantly more than the rest is because the question being asked is "Share something private", which of course is rather controversial considering you can't take back what you share online. See below section for the essay questions.
Drugs Drugs are illegal in some states

Essay Questions

The essay questions are a perfect candidate for Natural Language Processing (NLP) Topic Modelling. More specifically, we'll be using term frequency - inverse document frequency (tf-idf), a model which initially counts the occurences of each word, then applies a word weighting scheme which deprioritizes common words such as "the". The expected result is to condense the essay answers into specific keywords, which may be visualized.

Let's create a copy of all of the essay answers ,just in case we need to start over in the preprocessing for NLP.

Text Preprocessing

Before we can apply tf-idf to our essay questions, we need to convert them to a suitable format. The process is called text preprocessing. The words of our essay questions will be converted to their root words(also called lemma).

Term Frequency-Inverse Document Frequency

Now that we have our root words, let's apply tf-idf. It will assign a score to each root word. For each user's answer we will get the highest scoring word (keyword). This keyword represents what our model believes is the most significant word of a user's response.

Iterate over all essay columns

Let's iterate the whole process over all of our essay features, and let's collect only the most common keywords.

Top Keywords in User Responses to Essay Questions Visualized as Wordclouds

Let's generate wordclouds for each of the essay questions.
Developer Note: The code is embedded as an image because it was run in a separate Python 3.7 environment.

image-3.png

essay_wordclouds.png

The level of insight gained from each wordcloud, using the existing model algorithm, is not the same. Some give sufficiently interesting and useful results. Some tell us more about what essay question is being asked rather than what the answers to those questions are. The results merit further filtering, tweaking, and refinement of the algorithm to give us better keywords. Further NLP modelling and analysis deserves to be discussed in its own lengthy separate article/notebook. We will not explore it further here.

In addition to the wordclouds, manual reading of several user responses was done to better interpret the tf-idf results.

Interpretation of tf-idf Results

Essay Number Personal best guess of the question asked Comment on tf-idf Result
essay 0 Describe yourself Users use this essay question to talk about what they like, love, and the qualities of who(someone) they're looking for
essay 1 What are you currently doing? A lot of the smaller words are clear answers to the question while the biggest words are a little bit more difficult to interpret or could be weighted to have a lower score
essay 2 What are you good at? Users say they're good at listening and that they have a great smile and laugh
essay 3 Describe yourself physically Users talk about their eyes, hair, smile, and height
essay 4 What are your favorite books, movies, tv shows, music, food, etc Not a very meaningful result. More useful for guessing the essay question.
essay 5 You can't live without... Users cannot live without their cellphone, money, gym, job, god, sports, fun. Not sure though what it means that 'good' is top keyword
essay 6 What do you think about? Users think about life and the future
essay 7 What is a typical friday night for you? Users are with their friends on friday nights
essay 8 Share something private Users say 'message/ask me about private things and ill share it with you, but I won't share it here on my public profile'. From the previous section, we can also see that this is the least answered essay question.
essay 9 You would like me if... Doesn't seem like a very meaningful result

Further NLP Practice

There is still additional NLP analysis that we will not explore here but can definitely be applied. For example, we can see in essay0 the top keywords are 'love' and 'like'. What do users really mean when they use the word 'love'? Is it 'making love' or 'looking for love'? Are they using 'love' and 'like' interchangeably?

Additional NLP:

Visualization Recap

In this section, we visualized the following:

From our Numerical Features we learned:

Feature Comment
Age The median age is 30. The distribution is right-skewed. Most users are young and working adults.
Height Most heights range between 59 and 78 in (4.9 - 6.5 ft) with a median of 68in(5.67ft). The distribution appears normal.
Income The median income is \$50k. A quarter of all incomes are $25k. Most income distributions are log-normal (if the ultra-wealthy are not included), and this is no different.
optional_%unfilled Half of all users don't bother to fill 12% - 30% of optional fields. A quarter of users are below that range and a quarter of users are above that range.

From our Categorical Features we learned the stereotypical profile is:

We also discovered that our sample has a high risk of bias. In our sample data which is dated from 2012, although OKCupid was founded in 2004, practically everyone lives in California and half of that in the City of San Francisco.

With the use of bar plots we learned:

From our Essay Features we learned:

Essay Number Personal best guess of the question asked Comment on tf-idf Result
essay 0 Describe yourself Users use this essay question to talk about what they like, love, and the qualities of who(someone) they're looking for
essay 1 What are you currently doing? A lot of the smaller words are clear answers to the question while the biggest words are a little bit more difficult to interpret or could be weighted to have a lower score
essay 2 What are you good at? Users say they're good at listening and that they have a great smile and laugh
essay 3 Describe yourself physically Users talk about their eyes, hair, smile, and height
essay 4 What are your favorite books, movies, tv shows, music, food, etc Not a very meaningful result. More useful for guessing the essay question.
essay 5 You can't live without... Users cannot live without their cellphone, money, gym, job, god, sports, fun. Not sure though what it means that 'good' is the top keyword
essay 6 What do you think about? Users think about life and the future
essay 7 What is a typical friday night for you? Users are with their friends on friday nights
essay 8 Share something private Users say 'message/ask me about private things and ill share it with you, but I won't share it here on my public profile'. From the previous section, we can also see that this is the least answered essay question.
essay 9 You would like me if... Doesn't seem like a very meaningful result

In this section, we made the decision to strip outliers of age and height:

Outliers have the potential to increase memory usage, variance, and training time. The current objective is to make a predictive model that works. If that objective is accomplished, we have the option of reiterating on this project to incorporate outliers.

Feature Engineering Part 2

We need to do even more feature engineering before we go into Machine Learning. All our features have to be properly formatted and expanded/encoded. Any value that is not a number will not be understood by our ML training algorithm.

We will make a backup copy of our data at this point, so that we can reset back to this checkpoint if we want to undo any modifications. Let's also drop our languages sparse matrix. Previously we didn't include the fluency descriptors in our sparse matrix for our visualization. Later, when one-hot encoding, we will make a new sparse matrix for languages which contains the fluency descriptors.

Let's convert last_online from a string to a datetime format and split it up.

Let's drop columns that we have created splits from. If we need to use them as target variables for prediction we can recover any of them from our backup dataframe. Let's drop income because 80% of it's values are missing. Let's also drop _numethnicities and optional__%unfilled.

Let's apply one-hot encoding to our categorical variables. We will also encode a category to represent the null values of each feature, so that our ML model will include the user's decisions to share particular information when making predictions.

Let's make our dataframe have uniform datatypes. Just ignore the datatype of the essays for now (the ten object coumns).

We are now ready to move on to Machine Learning.

Feature Engineering Part 2 Recap

In this Section we accomplished the following:

Machine Learning to Predict Gender

We will evaluate and compare two different Machine Learning Models to predict gender.

Now is a good time to backup the dataframe.

Gender Classification with a Logistic Regression Model

Height and body_type_curvy are our top predictors. Probably because men are taller than women on average, and because men are not likely to describe themselves as curvy whereas women are.

AI Ethics: A model like this highlights the ethical consideration we must take when developing AI. For example, one of the predictors is job_computer / hardware / software. This could be misused to discriminate by gender, proclaiming that one gender is not fit for working in a technical computer job. Never be hasty to deploy a model that will be used on people.

Gender Classification with a Decision Tree Model

Machine Learning to predict Gender Recap

In this section we accomplished the following:

Date Recommendation using K-Means Clustering

Clustering, is a technique which groups similar data points together. Let's use this to group similar people together and recommend who you should date. People in the same cluster as you are the people who we will recommend.

Feature Selection

We start by reviewing the features we have on hand. Which, among the features, would you want your date to have in common with you?

After asking someone which features they would want to have in common with them, the chosen features are:

Let's isolate that subset of features.

Choosing a value of k

Now that we have our subset let's search for the best number of k clusters to use in our model.

The Inertia vs k graph we now have below took 5 hours to produce (hence the default setting above to skip execution of the code snippet). Inertia, is a metric which represents how spaced out the points of a cluster are relative to its centroid. A line has been drawn over the tail end of the graph so that we can clearly mark where the graph becomes linear. The specific point where the graph becomes linear is known as the elbow point, and is the number of clusters that we should use.

K_Clusters_evaluation_annotated-2.png

k = 100 is where the linearity begins. Let's create the model we will use for clustering with k = 100.

Let's investigate the uniformity of our cluster distribution.

The clusters are somewhat distributed across users.

Date Match Recommendations (only interactive if opened with Jupyter Notebooks and required packages are installed)

Let's discover who you can date! Select from the dropdown lists to enter your information. Click the 'Run Interact' button to predict your cluster number and display the profile of a potential date. Your cluster contains users that are similar to you - someone you might want to date! Click Run to see another random profile from your cluster.

Static Image Preview

image.png

Date Recommendation using K-Means Clustering Recap

In this section, we accomplished the following:

"}, "metadata": {}, "output_type": "display_data"}]}}, "aab248a1765c4572a162dd3e8eeb8950": {"model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {}}, "b950b0a9473d4e8692d029522f8bd064": {"model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": {"description_width": ""}}, "c0a085fffaa145e4ac94b4bbf2ceb3c1": {"model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {}}, "ca7ed6e15c26477d99d5ca3e3f77d193": {"model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "CheckboxModel", "state": {"description": "dropna", "disabled": false, "layout": "IPY_MODEL_65026f0a84324a1c8077b22b061228ee", "style": "IPY_MODEL_ed2cf6e6e83040bcb6b089de1c1ae505", "value": false}}, "d3ae06ff94ed431885c36b19d9267ea8": {"model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {}}, "d3b1a7102a414c84a4be43390b161454": {"model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DropdownModel", "state": {"_options_labels": ["body_type", "diet", "drinks", "drugs", "education", "ethnicity", "job", "location", "offspring", "orientation", "pets", "religion", "sex", "sign", "smokes", "speaks", "status", "diet_adherence", "diet_type", "city", "state/country", "offspring_want", "offspring_attitude", "religion_type", "religion_attitude", "sign_type", "sign_attitude", "dog_preference", "cat_preference", "has_dogs", "has_cats", "afrikaans", "albanian", "ancient greek", "arabic", "armenian", "basque", "belarusan", "bengali", "breton", "bulgarian", "c++", "catalan", "cebuano", "chechen", "chinese", "croatian", "czech", "danish", "dutch", "english", "esperanto", "estonian", "farsi", "finnish", "french", "frisian", "georgian", "german", "greek", "gujarati", "hawaiian", "hebrew", "hindi", "hungarian", "icelandic", "ilongo", "indonesian", "irish", "italian", "japanese", "khmer", "korean", "latin", "latvian", "lisp", "lithuanian", "malay", "maori", "mongolian", "norwegian", "occitan", "other", "persian", "polish", "portuguese", "romanian", "rotuman", "russian", "sanskrit", "sardinian", "serbian", "sign language", "slovak", "slovenian", "spanish", "swahili", "swedish", "tagalog", "tamil", "thai", "tibetan", "turkish", "ukrainian", "urdu", "vietnamese", "welsh", "yiddish", "num_ethnicities", "optional_%unfilled", "num_languages"], "description": "feature", "index": 10, "layout": "IPY_MODEL_d3ae06ff94ed431885c36b19d9267ea8", "style": "IPY_MODEL_71dc1ab0b7b548cfb4667d6c75b881f8"}}, "defdf3b1605b47ef84c1c4aa6471d94f": {"model_module": "@jupyter-widgets/output", "model_module_version": "1.0.0", "model_name": "OutputModel", "state": {"layout": "IPY_MODEL_36fefcb2c3d548db918408f324f8bb43", "outputs": [{"data": {"application/vnd.plotly.v1+json": {"config": {"plotlyServerURL": "https://plot.ly"}, "data": [{"domain": {"x": [0, 1], "y": [0, 1]}, "hovertemplate": "label=%{label}
Percent=%{color}
parent=%{parent}", "labels": ["average", "fit", "athletic", "nan", "thin", "curvy", "a little extra", "skinny", "full figured", "overweight", "jacked", "used up", "rather not say"], "marker": {"coloraxis": "coloraxis", "colors": [24.441997798018217, 21.20408367530778, 19.716077803356356, 8.834617822707102, 7.858739532245687, 6.545891302171955, 4.385613719013779, 2.9643345677776667, 1.683181530043706, 0.740666599939946, 0.7022987355286424, 0.5921996463483802, 0.3302972675407867], "showscale": false}, "name": "", "parents": ["body_type", "body_type", "body_type", "body_type", "body_type", "body_type", "body_type", "body_type", "body_type", "body_type", "body_type", "body_type", "body_type"], "texttemplate": "%{label}
%{value:.2f%}%", "type": "treemap", "values": [24.441997798018217, 21.20408367530778, 19.716077803356356, 8.834617822707102, 7.858739532245687, 6.545891302171955, 4.385613719013779, 2.9643345677776667, 1.683181530043706, 0.740666599939946, 0.7022987355286424, 0.5921996463483802, 0.3302972675407867]}], "layout": {"coloraxis": {"colorbar": {"title": {"text": "Percent"}}, "colorscale": [[0, "rgb(247,251,255)"], [0.125, "rgb(222,235,247)"], [0.25, "rgb(198,219,239)"], [0.375, "rgb(158,202,225)"], [0.5, "rgb(107,174,214)"], [0.625, "rgb(66,146,198)"], [0.75, "rgb(33,113,181)"], [0.875, "rgb(8,81,156)"], [1, "rgb(8,48,107)"]]}, "legend": {"tracegroupgap": 0}, "template": {"data": {"bar": [{"error_x": {"color": "#2a3f5f"}, "error_y": {"color": "#2a3f5f"}, "marker": {"line": {"color": "#E5ECF6", "width": 0.5}}, "type": "bar"}], "barpolar": [{"marker": {"line": {"color": "#E5ECF6", "width": 0.5}}, "type": "barpolar"}], "carpet": [{"aaxis": {"endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f"}, "baxis": {"endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f"}, "type": "carpet"}], "choropleth": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "choropleth"}], "contour": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1, "#f0f921"]], "type": "contour"}], "contourcarpet": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "contourcarpet"}], "heatmap": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1, "#f0f921"]], "type": "heatmap"}], "heatmapgl": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1, "#f0f921"]], "type": "heatmapgl"}], "histogram": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "histogram"}], "histogram2d": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1, "#f0f921"]], "type": "histogram2d"}], "histogram2dcontour": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1, "#f0f921"]], "type": "histogram2dcontour"}], "mesh3d": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "type": "mesh3d"}], "parcoords": [{"line": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "parcoords"}], "pie": [{"automargin": true, "type": "pie"}], "scatter": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatter"}], "scatter3d": [{"line": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatter3d"}], "scattercarpet": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattercarpet"}], "scattergeo": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattergeo"}], "scattergl": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattergl"}], "scattermapbox": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scattermapbox"}], "scatterpolar": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatterpolar"}], "scatterpolargl": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatterpolargl"}], "scatterternary": [{"marker": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "type": "scatterternary"}], "surface": [{"colorbar": {"outlinewidth": 0, "ticks": ""}, "colorscale": [[0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1, "#f0f921"]], "type": "surface"}], "table": [{"cells": {"fill": {"color": "#EBF0F8"}, "line": {"color": "white"}}, "header": {"fill": {"color": "#C8D4E3"}, "line": {"color": "white"}}, "type": "table"}]}, "layout": {"annotationdefaults": {"arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1}, "autotypenumbers": "strict", "coloraxis": {"colorbar": {"outlinewidth": 0, "ticks": ""}}, "colorscale": {"diverging": [[0, "#8e0152"], [0.1, "#c51b7d"], [0.2, "#de77ae"], [0.3, "#f1b6da"], [0.4, "#fde0ef"], [0.5, "#f7f7f7"], [0.6, "#e6f5d0"], [0.7, "#b8e186"], [0.8, "#7fbc41"], [0.9, "#4d9221"], [1, "#276419"]], "sequential": [[0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1, "#f0f921"]], "sequentialminus": [[0, "#0d0887"], [0.1111111111111111, "#46039f"], [0.2222222222222222, "#7201a8"], [0.3333333333333333, "#9c179e"], [0.4444444444444444, "#bd3786"], [0.5555555555555556, "#d8576b"], [0.6666666666666666, "#ed7953"], [0.7777777777777778, "#fb9f3a"], [0.8888888888888888, "#fdca26"], [1, "#f0f921"]]}, "colorway": ["#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"], "font": {"color": "#2a3f5f"}, "geo": {"bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white"}, "hoverlabel": {"align": "left"}, "hovermode": "closest", "mapbox": {"style": "light"}, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": {"angularaxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}, "bgcolor": "#E5ECF6", "radialaxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}}, "scene": {"xaxis": {"backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white"}, "yaxis": {"backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white"}, "zaxis": {"backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white"}}, "shapedefaults": {"line": {"color": "#2a3f5f"}}, "ternary": {"aaxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}, "baxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}, "bgcolor": "#E5ECF6", "caxis": {"gridcolor": "white", "linecolor": "white", "ticks": ""}}, "title": {"x": 0.05}, "xaxis": {"automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": {"standoff": 15}, "zerolinecolor": "white", "zerolinewidth": 2}, "yaxis": {"automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": {"standoff": 15}, "zerolinecolor": "white", "zerolinewidth": 2}}}, "title": {"text": "2012 OKCupid Profiles"}}}, "text/html": "
"}, "metadata": {}, "output_type": "display_data"}]}}, "ed2cf6e6e83040bcb6b089de1c1ae505": {"model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": {"description_width": ""}}}, "version_major": 2, "version_minor": 0}